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ABSTRACT 

We propose to describe the variety of galaxies from the Sloan Digital Sky Survey (SDSS) by using 
only one affinc parameter. To this aim, we construct the Principal Curve (P-curve) passing through 
the spine of the data point cloud, considering the eigenspacc derived from Principal Component 
Analysis (PCA) of morphological, physical and photometric galaxy properties. Thus, galaxies can 
be labeled, ranked and classified by a single arc length value of the curve, measured at the unique 
closest projection of the data points on the P-curve. We find that the P-curve has a "W" letter shape 
with 3 turning points, defining 4 branches that represent distinct galaxy populations. This behavior 
is controlled mainly by two properties, namely u — r and SFR (from blue young at low arc length to 
red old at high arc length), while most of other properties correlate well with these two. We further 
present the variations of several important galaxy properties as a function of arc length. Luminosity 
functions variate from steep Schechter fits at low arc length, to double power law and ending in 
Log-normal fits at high arc length. Galaxy clustering shows increasing autocorrelation power at large 
scales as arc length increases. Cross correlation of galaxies with different arc lengths shows that the 
probability of 2 galaxies belonging to the same halo decreases as their distance in arc length increases. 

PCA analysis allowed to find peculiar galaxy populations located apart from the main cloud of data 
points, such as small red galaxies dominated by a disk, of relatively high stellar mass-to-light ratio 
and surface mass density. On the other hand, the P-curve helped understanding the average trends, 
encoding 75% of the available information in the data. 

The P-curve allows not only dimensionality reduction, but also provides supporting evidence for 
relevant physical models and scenarios in extragalactic astronomy: 

1) Evidence for the hierarchical merging scenario in the formation of a selected group of red massive 
galaxies. These galaxies present a log-normal r-band luminosity function, which might arise from 
multiplicative processes involved in this scenario. 

2) Connect i on be tween the onset of AGN activity and star formation quenching as mentioned in 
iMartin et ail ()2007[ ). which appears in green galaxies when transitioning from blue to red populations. 
Subject headings: cosmology: large-scale structure of universe — galaxies: general — galaxies: lu- 
minosity function, mass function — galaxies: fundamental parameters — galaxies: 
absorption lines — galaxies: emission lines — galaxies: photometry — galaxies: 
statistics — methods: statistical — methods: data analysis 



1. INTRODUCTION 

In order to constrain the physical processes driving 
galaxy evolution, it is common practice to measure a 
number of physical properties for a set of galaxies, and 
then investigate the correlations between these parame- 
ters. In this context, galaxy surveys have become more 
and more appropriate. The number of galaxies available 
is getting larger, and the amount of information to con- 
strain physical properties is also increasing, yielding to 
more accurate estimates. The level of precision of these 
estimates is also likely to increase in the future, either 
with the combination of wide angle surveys observing 
at different wavelengths, or with panchro matic surveys 
using large number of filters (e.g. PAU, iBenitez et al.l 
I2009T ) . which will benefit from multiband imaging for 
millions of galaxies. As this data deluge is turning as- 
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tronomy into be coming a data-intensive or e-science (see 
IHev et al.ll2009h . one is confronted with the issue of be- 
ing just able enough to analyze the feature space, whose 
dimensionality keeps on increasing. In face of such large 
amount of physical properties, one wants to find the min- 
imal and most important set which describes galaxies 
accurately. In this context, a common approach used to 
reduce the dimensionality of these dataset is perform- 
ing a Principal Component Analysis (PCA, also known 
as Karhuncn-Locvc transform; see e.g. lEfstathiou fc Falll 
fl98ilMurtagh fc Heck|[T987l ) . PCA enables us to find an 
uncorrelated and orthonormal set of linear combinations 
of properties (eigenvectors) that describe optimally the 
correlations and variation of the data. This approach has 
been fruitfully used in astronomy to classify galaxy and 
quasars based o n their spectra (jConnollv et al. 1 119951 : 
lYip et alJl2004al rbl). PCA has be applied on a wider ba- 
sis using various galaxy properties such as t he equivalent 
width of emission lines (|Gv6rv et all [201 ll) or a mix of 
spectral and morphological features ( Coppa et al.l 120 lH) 
to help characterizing the galaxy population. PCA also 
showed useful for instance when applied to stellar synthc- 
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sis p opulation models to derive galaxy physical parame- 
ters (|Chen et al.ll2011l ). PCA does not however, enable to 
capture all the information contained in the input sam- 
ple. It is by nature linear, and hence can not describe 
non linear correlations within the data. Other methods, 
such as applying locally linear embedding to galaxy spec - 
tra (jRoweis fc Saull[2000HVanderplas fc Connollvl 12001) . 
enable to take into account non linearities, as they map 
high dimension data onto a surface, while preserving the 
local geometry of the data. 

In this pap er, we introduce the principal curve (P- 
curve, see e.g. lEinbeck et al.l 120071 for a review), which 
can be seen as a nonparametric extension of linear PCA. 
The principal curve is the curve following the location 
of the local mean in the multi-dimensional cloud of data 
points. In practice, the P-curve can be conveniently built 
in the PCA cigenspace spanned by the most important 
eigenvectors along which the variance is highest. The 
important fact here is that every data point can be as- 
signed a unique closest projection onto the curve, and 
be labeled by the arc length value measured from the 
beginning of the curve to the projection. This reduces 
the complexity of multi dimensional data effectively into 
only one dimension. Moreover, the ranking of galaxies 
according to their associated arc length values provides 
a natural and objective way of ordering, partitioning and 
classifying the rich zoo of galaxies in the nearby universe. 

In this paper, we take advantage of the wealth of data 
and build the principal curve for both physical and pho- 
tometric properties belon ging to the low red shift Main 
Galaxy Sample (MGS ) dstrauss et al.l 120021) in SDSS 
(jStoughton et alj|2002h . Since the MGS is flux limited, 
the Malmquist bias underestimates the volume density 
of faint galaxies compared to that of brighter ones. As 
a result, the common practice of performing a simple 
PCA for all galaxies does indeed provide a biased re- 
sult toward the behavior of the properties of bright ob- 
jects. As a solution, we do not restrain the statistics by 
constructing a much smaller volume limited sample, but 
keep all galaxies by assigning them weights with which 
we perform Weighted PCA (WPCA) and P-curvc meth- 
ods. We then investigate how the arc length associated 
to each galaxy correlates with a number of photometric, 
spectroscopic and physical galaxy properties, as well as 
morphology, mean spectra, and a first (luminosity func- 
tion) and second (clustering) moments of galaxies. Our 
results show that the arc length values remarkably en- 
code a large number of well-known trends in the local 
Universe. 

This paper is organized as follows: in Section [3] we 
present the dataset we use. Section [3] details the galaxy 
properties we include in our PCA analysis. Section [4] 
presents the methods we use for the dimensionality re- 
duction, weighted PCA and principal curve. We detail in 
Sec. [5] how we build the principal curve from the SDSS 
data. In Section we present our results and discuss 
them in Sec. [7] 

We use in this paper a flat A cosmology assuming 

{n Xl n M M,w } = {0.7,0.3,0.7,-1}. 

2. THE GALAXY SAMPLE 

In this paper we use photometr ic and spectroscopi c 
data of galaxies from SDSS-DR8 (jAihara et all 120111 ). 
available in a MS-SQL Server database queried online 



via Cas Jobs 0. 

In particular, we u se the Main Galaxy Sample (MGS) 
(|Strauss et all 120021) . These galaxies constitute a flux 
limited sample, with an r-band pctrosian apparent mag- 
nitude cut of m r < 17.77, and a redshift distribution 
peaking at z ~ 0.1. Their spectra covers the rest 
frame range of 3800-8000A, with a resolution of 69 km 
s pix . 

Several selection cuts and flags were enforced in or- 
der to have a clean sample. We selected only science 
primary objects appearing in calibrated images having 
the photometric status flag. Also, we selected imag- 
ing fields where 0.6 < score < 1.0, which assures 
good imaging quality with respect to the sky flux and 
the PSF's width. Furthermore, we neglected individ- 
ual objects with bad deblending (with flags PEAKCENTER, 
DEBLENDJfOPEAK , N0TCHECKED) and interpolation prob- 
lems (PSF_FLUX_INTERP , BAD_CDUNTS_ERR0R) or suspi- 
cious detections (SATURATED N0PR0FILE) 0. Also, we 
chose galaxies whose spectral line measurements and 
properties are labeled as RELIABLE. 

The sky footprint of the clean spectroscopic sur- 
vey builds up from a complicated geometry defined by 
sectors, whose aggregated area covers ~ 7930deg 2 or 
a fractional area Fa — 0.192 of whole sky. We choose 
a redshift window of [z x ,z 2 ] = [0.02,0.08]. The lower 
limit avoids including large photometrically-cumbersome 
galaxies on the sky, and the upper limits reduces the 
amount of evolution of galaxy properties (At < 0.78Gyr), 
while keeping the statistics high. Redshift incomplete- 
ness arises from the fact that two 3" aperture spectro- 
scopic fibers cannot be put together closer than 55" in the 
same plate. As an strategy, denser region in the sky are 
given a greater number of overlapping plates. Neverthe- 
less, 7% of the initial galaxies photometrically targeted 
as MGS didn't have their spectra taken. 

We further construct a magnitude limited sample, 
on which we will center our main study. Here, 
extinction-corrected petrosian apparent magnitude cuts 
of [771^1,771^2] = [13.5,17.65]) are applied. The lower 
limit is set due to the arising cross talk from close fibers 
in the spectrographs, when they contain light from very 
bright galaxies. The upper limit safely avoids the slight 
variations of the limiting apparent magnitude around 
17.77 over the sky in the targeting algorithm. This leaves 
us with 174,266 galaxies. 

A volume limited subsample was also created, be- 
ing a subset of the previous magnitude limited sample. 
This subsample is used for the study of spatial corre- 
lation functions in Sec. 16.41 The redshift ranges are 
[z\,Z2\ = [0.02,0.05], with an absolute magnitude win- 
dow of [M ri i,M r , 2 ] = [-21.19,-19.08], which leave us 
with — 40, 000 galaxies. 

3. SELECTING GALAXY PROPERTIES 

Galaxies present a variety of physical, spectroscopic 
and 5-band photometric properties made available in the 
SDSS-DR8 data catalog. We selected the most relevant 
in order to create a p-Dimensional cloud of properties or 
features for further study. 

Within the photometry-derived properties included are 
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the colors, which show the coarse shape of galaxy spectra, 
and in some extend the age of the overall stellar popula- 
tion in the galaxy. Only the colors u — r and g — r were se- 
lected, since most of th e color combinations p ossible from 
the u,g,r,i,z bands (jFukugita et al.l fl996f ) are highly 
correlated. For co mputing colors, extinc tion-corrected 
model magnitudes (Stought on et al.|[2002f) were used, as 
well as k-corrections to an observing rest-frame of z = 0. 
The k-corrections are calcul ated by using a t e mplat e fit- 
ting technique used in e.g. iBudavari et al.l (|2000f ) and 
ICsabai et al.l (j2000f ). Here, the colors are matched to 
the colors of a model spectrum defined by a non neg- 
ative linear combination of rcdshiftcd template spectra. 
Then, the best model spectrum is blueshifted back to 
the rest frame (z=0) and the k-correction computed. 
The template spectra are drawn from a list provided by 
IBruzual fc CharlotJ (pOOl . 

Since we will study the luminosity function as a func- 
tion of position in this cloud ( Section |6.3[) . we decided 
not to include the absolute magnitude M r as a property. 
If we did, any partitioning of the cloud would introduce 
non-desired artificial cuts in the range of absolute mag- 
nitudes used in the computation of luminosity functions. 
Therefore, neither the absolute magnitude nor any other 
strongly correlated property of it (such as stellar mass) 
should be chosen as part of the properties. 

Another photometry-derived feature is the concentra- 
tion index C = i?90 r /i?50 r , where i?90 r and i?50 r are 
the radii enclosing the 90 and 50% of the r-band pet- 
rosian flux, respectively. This index has been found to 
corre l ate with galaxy morpho logical type (jStrateva et al.l 
I200U iShimasaku et al] I2001D . Indeed, de Vaucouleurs 
light profiles of elliptical galaxies are more concentrated 
than the exponential profile in the disks of spiral galaxies. 

The redshift-dependent r-band surface brightness de- 
fined by ^5o,r = iTi r + 2.5 log[27ri?50^(l + z) 4 ] is also 
included as a property. This breaks the degeneracy 
of i?90 r /i?50 r between bright and dim spiral galaxies. 
Here we use the extinction and k-corrected petrosian ap- 
parent magnitude m r , taking y /2R50 r as a less noisy 
proxy for the petr osian radius (jStoughton et al.l 120021 : 
IStrauss et aHl2002t) . 

The physical properties selected are the star forma- 
tion rate (SFR), specific star formation rate (SFR/AT,, 
where TVf* is the stellar mass) and petrosian r-band 
mass-to- light ratio (M*/L r ). These are included in 
SDSS-DR8 and obtained from galax y spectra analysis 
at MP A and JHU Ejj, as det ai led in iKauffmann et aT 
2003al). IBrinchmann et al I (120041) . iTremonti et a! 



200lTlGalia"zzi et al.l ^OOl Tand ISalim et al.l (120071) 



Note that M* has been derived from template fit- 
ting to the total flux in the 5 photometric bands 
(jAihara et al.l 120111) . As the spectral fibers diameter 
cover only 3" of the central part of each galaxy, the SFR 
had to be corrected for t his deficiency to its full value 
(|Brinchmann et al. 1 120041) . 

Since other spectral features such as lick indices or 
line equivalent widths are non-trivial to extrapolate 
from their fiber values to the full galaxy ones, we do not 
include these in the building of the cloud of prop erties. 
We do, however, study them apart in Sections 16.21 and 171 
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4. METHODS FOR DIMENSIONALITY REDUCTION 

Most of the time, data-mining deals with the data 
point matrix A = [A 1 A 2 ...A P ] <G R Nxp , composed by 
the columns {A I }^ =1 that contain TV observations for 
each of the p properties or features. Thus A can then be 
thought as a length- TV realization of the random vector 
X = [Xi...X p ] £ W with distribution £> x (x). 

In our work, dimensionality reduction is used for ex- 
plaining the variations of X as function of only 1 parame- 
ter. For that effect, we use Weighted PCA and Principal 
Curves, whose detailed descriptions are included in Sec- 
tions PO and lO 



4.1. Weighted PCA (WPCA) 

Principal Components An alysis (PCA) (|Pearsonlll901l ; 
lJacksonlll99ll ; Uolliffel [20021) . also known as Karhunen- 
Loeve Transform, is a widely used method for dimen- 
sionality reduction and classification. It can be seen as 
a transformation involving a translation, linear scaling 
and rigid rotation of a collection of TV p-dimcnsional data 
points onto a new coordinate system. The new orthonor- 
mal axes, or principal components {PC;}f =1 <E K Arxl , 
are constructed such that the projections of the data 
points on the PC^s are uncorrelated. PCi is selected 
as the axis on W which has the highest possible variance 
of the points projected onto it. The next PC^s are or- 
dered in descending value of the variance, having PC p 
the lowest. Thus, dimensionality reduction is attained 
by describing the data in terms of the m ost important 
principal components (jHastie et al.l|200"9l ). This can be 
obtained by considering only the space spanned by the 
first q < p variance-ranked eigenvectors whose cumula- 
tive variance reaches above a high enough threshold. 

In practice, the PCs and their variances can be found 
using singular value decomposition ( SVD) of the covari- 
ance matrix C of the data points (|Golub fe Van Loan! 
Il996f ). SVD allows us to factorize it in the form C = 
VSV T . Here, £ is a diagonal matrix with the eigenval- 
ues (variances) and V contains the eigenvectors (prin- 
cipal components) in the respective columns. Thus, V 
contains the expansion coefficients of the transformation 
PC; = Et^V jiX-j (i = 1, ..,p) from property space to 
PC space. 

In Weighted PCA, the covariance matrix is calculated 
in a weighted schema. Many times we are confronted 
with noisy or missing data points. As a solution, we can 
assign a weight Wi > to each ith data point in order to 
account for the noisy data points or the missing ones. In 
this context, WPCA involves considering these weights 
in the calculation of all averages and covariances between 
the p properties. In general, the properties might have 
different units, for which they have first to be made unit- 
less by standardization of the data points (subtract to 
each property its (weighted) average and then divide it 
by its (weighted) standard deviation). 

4.2. Principal Curves 

Principal curves (P-curves) and surfaces (P-surfaces) 
(IHastid Il98l IHastie fe Stuetzld fl98l ITibshiranil [i992t 
iGorban et al.l 120081 ) go one step ahead of PCA, provid- 
ing a low-dimensional curved manifold that passes trough 
the middle of the data points. In this paper we con- 
sider a 1-paramcter (called I) principal curve f(^) = 
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[/i(0> fp( 1 )} G K Arxl ! where each of the TV data points 
x = [xi.-.Xp] is given a unique closest projection f(Zf(x)) 
onto the curve. As a convention, Zf(x) is chosen to be 
the arc length from the beginning of the curve to the pro- 
jection point of x. Under this context, the P-curve can 
be considered by itself the 1st and only curved principal 
component, as the dimensionality of the data is reduced 
from p to 1 dimension. In practice, the P-curve is com- 
posed by N — 1 line segments that connect the projection 
points. 

The principal curve is defined as the average of the 
data-points that project onto it, minimizing the projec- 
tion distance between x and f(7f(x)) over all points. 
This property of self consistency allows us to follow 
a series of ite rative projection-expect ation steps for its 
construction (jHastie fc StuetzIelll989D . In fact, an edu- 
cated first guess for the P-curve is to make it equal to 

PCi. Later on, the jth estimate f\ (I) of the curve 

at the jth expectation step is calculated as f- ■ (I) = 
E (X.i\l fu-i) (X) = I). In practice, we compute this 
expression using a weighte d penalized cubic B-spline 
regression ([Silverman! fl985t IHastie fe Tibshiranil [l990t 
IRuppert et alj|2003t IHastie et all 12009(1 . These splines 
are calculated on a series of k knots chosen from the 
data points, while the degrees of freedom (df) of the re- 
gression control the degree of smoothing of the P-curve. 
On the other hand, the jth projection step is performed 
next, involving the search for the closest perpendicular 
projection of x onto f^\l), which is composed by the 
N — 1 line segments. The iterations stop when the cu- 
mulative projection distances from the data points to the 
P-curve do not change significantly with respect to the 
one in the previous step. 

Although P-curves are constructed on the p- 
dimensional space of properties, we can consider build- 
ing the P-curve of the data points projected on the first 
q most important principal components of the WPCA. 
This minimizes the complexity and computations, spe- 
cially in the case p ^> q, without losing much informa- 
tion. The approximation is of course valid as long as the 
first q eigenmodes contain as much of the total variance 
as possible. 

5. BUILDING THE PRINCIPAL CURVE AND POPULATION 
SEPARATORS ALONG ARC LENGTH 

5-1. U max weighting 

As the MGS is a magnitude limited sample, not all 
galaxy types are sampled equally in the survey volume. 
As a consequence, we used WPCA and a weighted prin- 
cipal curve of the galaxy population to get an unbiased 
result. 

In detail, at higher redshifts we sample mostly the 
brightest galaxies, neglecting the faint ones (Malmquist 
bias). On the other side, at low redshifts the SDSS spec- 
trograph fails to take the spectra of very bright galaxies 
(see Section [2]). 

As a solutio n, we use the V max weighting method 
(Schmidt 1968) to account for this incompleteness. Here, 
each i-th galaxy is assigned a weight u>.; = Vs/V max % > 1, 
where Vs is the volume of the survey. Here we note that, 
given the particular [z±, z 2 ] and [mi, 7712] intervals for the 
survey, the i-th galaxy found at z% could be observed 
only within a maximum comoving volume Vmax,i < Vs. 



If the i-th galaxy of apparent magnitude m„ k-correction 
ki = k(zi), and at a luminosity distance Z?i(zj) were to 
have limiting apparent magnitudes TOi,2, then it should 
be moved to a limiting luminosity distance Al^ (777.1,2) 
given by 

= D L (zi) x io(" l i.=- fc (*nm)-"w+fci)/5 - (!) 

Hence, the maximum volume is defined by the biggest 
interval of Dl inside which a galaxy can appear in the 
survey: 

Vmax,; = [V(mm(D L (z 2 ), DL,i{z\i m ;m2))) 

-F(max(-Di(z 1 ),D iii (Aim;mi)))] x Fa, (2) 

As Eq. ((T|) defines zn m in an implicit way, we solve for it 
itcratively. We calculated the V max values directly inside 
the dat abase using an integrate d cosmological functions 
library (|Taghizadeh-Popp|l2010H . 

The PCA, P-curve and calculations related to volume 
densities in this paper (such as histograms) are all V max 
weighted. 

5.2. WPCA results 

As measure for not skewing the PCA, we clipped off vi- 
sually the outliers in each of the p = 7 galaxy properties 
in order to dismiss artifacts or wrong measurements. We 
also used only galaxies which have all the 7 properties 
measured, simpli fying the calculations and avoiding us- 
ing Gappy PCA (IConnollv fc Szalav lU99l . This left us 
with a total of TV = 171, 698 galaxies (99% of the initial 
ones). 
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Fig. 1. — The V matrix resulting from applying WPCA on the 7 
galaxy properties. The columns are the orthonormal principal com- 
ponents (i.e. eigenvectors of the covariance matrix). Each PC^, 
i = 1.....7 can be viewed as a linear combination of properties, 
with the expansion coefficients ~Vji of the jth property stored in 
the jth row. Coefficients with stronger color show a higher impor- 
tance of the property for the given PC. The sign of the coefficient 
shows correlations/anticorrelations between the properties and the 
PC. 

Figure [T] and Table [1] present the results from com- 
puting WPCA on the 7 galaxy properties. From Table 
[TJ we can notice that most of the information (97% of 
the total variance) is contained in the first 4 principal 
components. On Fig. [U each PCj, i = 1,...,7 can be 
viewed as a linear combination of properties, with the ex- 
pansion coefficients ~Vji of the jth property stored in the 
jth row. Coefficients with stronger color show a higher 
importance of the property for the given PC. The sign 
of the coefficient shows correlations/anticorrelations be- 
tween the properties and the final value of the PC. 
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Fig. 2. — Principal curve (black continuous line) fitted to the first 4 principal components (density maps are log-scaled, with contour 
curves separated by 0.5 dex). The arc length increases in the direction of increasing PCi. The first and last 15 data points (ordered by 
arc length) are connected to their corresponding projections on the curve with dashed lines. The separators between the {£<i}^f groups 
are shown as black circles on top of the curve (see text). The curve presents 3 turning points marked as brown rings. The colored arrows 
show the directions and relative strength of the galaxy properties projected onto PC space. In PC3 v/s P C4, we plotted the separating 
line between the main cloud and a small blob of galaxies, given by PC4 < —1.3 + 0.55(PC3 — 2.0) (see Sec. 16. 5. 111 . 

with an opposite correlation with respect to the next 
important properties of mostly equal strength (u — r, 
g — r, SFR, M*/L r and ^50, r)- We can expect that 
big and bright star-forming spiral galaxies with reddish 
colors (probably from a red core) should have high PC3. 

For PC4, all the properties have the same correlations, 
being ^50, r, i?90 r /i?50 r and SFR the most important. 
Thus, concentrated (and possibly star-forming) galaxies 
of faint surface brightness have high values of PC4. As 
the variance along PC4 is much smaller than the pre- 
vious PCs, this is a rare combination of correlation for 
these properties to be observed at the same time. 

Furthermore, the last 3 PCs (PC 5 ,PC 6 and PC 7 ), 
which account for less than 2% of the total variance, 
are less obvious to interpret. They might trace either 
special cases of galaxy populations, or just artifacts and 
wrong/noisy measurements of the properties. 
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Variance for each principal components and its associated cu- 
mulative variance. Since the data has been standardized, the 
sum of the variances is equal to the number of dimensions 

(P = 7). 

For PCi, the strength (absolute magnitude) of its ex- 
pansion coefficients Vji in the basis of the galaxy prop- 
erties is shared mostly evenly between these properties, 
being u — r, g — r, SFR/M* and M*/L r the most im- 
portant. The correlations show that high values of u — r, 
g — r, M*/L r and R90 r / R50 r , together with low val- 
ues of /U5o,r, SFR/M* and SFR, will produce a high 
PCi value. We might therefore expect that PCi is a 
good separator between the young, blue population of 
spirals/irregulars and the old population of red old ellip- 
ticals. 

For PC2, the SFR and ^50, r are the most important, 
having opposite signs. Thus, we expect galaxies with 
bright surface brightness and high star formation to show 
high values of PC2. 

For PC 3 , the most important property is i?90 r /i?50 r , 



5.3. The fitted Principal Curve and Population 
Separators along it 

We decided to construct the Principal curve in the 4- 
dimensional space defined by {PCi, PC4}, since their 
combined cumulative variance (0.973) is close to unity 
(see Table [1]). Although the computation for the number 
of dimensions and data points involved is not too inten- 
sive, we think of this as a pedagogical example that can 
be used for other extreme cases when N > 10 10 objects 
with p > 100 dimensions, for instance. In fact, our elec- 
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Turning points 
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Fig. 3. — Probability density of the arc-length values {h} 1 p ' 
onto. The arc-length increases in the direction of increasinj 



the numbers denote the 
{Ai, Aioo} 



i=20 

l 




Fig. 4. — Density maps of the principal components (y-axis) as 
a function of the arc- length I (x-axis). Density is log-scaled, con- 
tour curves separated by 0.5 dex. Population separators are shown 
as vertical tick marks. The numbers denote the {L;}*^ galaxy 
groups. Colored vertical lines show the position of the maxima and 
turning points at particular I values. 

tion does not change significantly the results compared 
to using p = 7. 

In the expectation step for creating the principal curve, 
each PCi is fitted with penalized B-splines of 5.4 degrees 
of freedom (df), defined at a sequence of k = 211 unique 
knots chosen at equally spaced quantiles of arc-length 
values. Principal curves with df > 7 make the curve to 
oscillate excessively, turning back and forth across and 
along PCi, whereas with df ~ 4 resemble more closely 
a straight line along the PCi direction. 

Figure [2] shows the result of fitting the principal curve 
to {PCi, PC4}. The 4-dimcnsional cloud of prop- 
erties presents 2 density maxima placed mainly along 
the PCi direction, corresponding to the blue and red 
galaxy populations. The principal curve mostly resem- 
bles the letter " W" , presenting clearly 4 different regimes 
or branches separated by 3 turning points (T-points). 

We created 20 equal number density galaxy groups (in 
Mpc -3 ) labeled as {Li}\^° by placing population sep- 
arators at fixed arc length values along the P-curve, as 



, measured at the points of the curve where the N data points are projected 
PCi. Vertical black continuous lines denote the population separators, while 
;alaxy groups. The 5 small tick marks within each galaxy group mark the boundaries of the subgroups 



shown in Figures [2] and [3l Galaxies are grouped together 
into the same Li group when the arc length values mea- 
sured at their projections points onto the P-curve are 
placed between 2 consecutive separators. These sep- 
arators are positioned in such a way that the (V max - 
wcightcd) number density (in Mpc -3 ) of the galaxies be- 
longing to each of the 20 Li groups amounts to l/20th 
of that from the whole sample of galaxies. This allowed 
us to study the 4 principal curve branches in detail. We 
chose the arc-length to increase in the same direction 
of increasing PCi, with growing values of arc length as 
we progress from L\ to Liq. Thus, the P-curve's 1st 
branch is comprehended in {L\, L&}, the 2nd branch 
in {£7, £14}, the 3rd branch in {L15, Lig} and the 
4th branch in {£19,^20}- Tabled] shows some statistics 
of these groups. 

Within each Li group, we further created 5 subgroups 
of galaxies along arc length naming them {A^} also 
of equal number density in Mpc -3 as explained before. 
We further partitioned these groups similarly, now using 
several radial separators in the perpendicular direction 
to the curve, defining 10 concentric cylinder-like sepa- 
rating surfaces. In this way, the groups defined by this 
finer partitioning have all the same number density (in 
Mpc -3 ), equal approximately to l/1000th of the number 
density of the whole sample. This allowed us to identify 
and extract localized galaxy populations positioned very 
close to the spine of the cloud of properties, and study 
them in Sec. 16.5.21 

Figure [3] shows the probability density distribution of 
the arc-length I values, as well as the population sepa- 
rators. The curve has a length of l max =20.24:, and the 
variance of the arc length values is of = 7.79, measured 
with respect to the center of the curve at < I >= 7.91. 
Note that Table [2] shows that the quadratic mean (root 
mean square) of all the projection distances from the data 
points to the P-curve takes a value of d± = 1.31, which 
is small compared to the length of the curve. The blue 
and red peaks of maximum density are clearly visible, 
as well as a small green peak. The 1st turning point (at 
Lq) lies closely with the blue maximum (£7), whereas the 
red maximum (L17) is a little behind of the 3rd T-point 
(Lig), after which we can find a hump defining the red 
sequence of galaxies. We find a green maximum (L\q) 
standing in between the 2nd T-point (-£14) and the red 
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maximum. 

Figure |4] shows the density maps of the scatter of each 
{PCi, PC4} as a function of the arc-length. The dif- 
ferent shapes that this scatter presents depend evidently 
on the contortions or twists of the principal curve along 
the PCs. As the 4 branches of the curve mostly turn 
left and right along PC2, the scatter in PC2 show the 
same "W" shape as the P-curve. On the other hand, 
the curve increases its length into the PCi direction, so 
the scatter shows a mostly linear relation between PCi 
and arc length. The same analysis applies to the scatter 
of the next PCs, which is boomerang-shaped for PC3 
and mostly constant with respect to arc-length for PC4 
(although with little wiggles). 



TABLE 2 

Statistics of the galaxy group. 



Group 




^min 


^max 


< I > 


d±_ 


Li 


4050 





4.12 


3.54 


1.57 


L 2 


2987 


4.12 


4.67 


4.41 


1.34 


L 3 


2368 


4.67 


5.09 


4.87 


1.25 


Li 


2136 


5.09 


5.54 


5.32 


1.14 


L S 


1589 


5.54 


5.88 


5.72 


1.12 


L 6 


1345 


5.88 


6.11 


6.01 


1.17 


L- 


1674 


6.11 


6.31 


6.21 


1.27 


L a 


2190 


6.31 


6.55 


6.44 


1.29 


L 9 


3568 


6.55 


6.87 


6.71 


1.12 


L10 


6287 


6.87 


7.25 


7.06 


1.16 


in 


10196 


7.25 


7.67 


7.46 


1.21 


L12 


13862 


7.67 


8.12 


7.89 


1.31 


Ll 3 


17062 


8.12 


8.63 


8.37 


1.39 


L14 


18283 


8.63 


9.24 


8.93 


1.51 


Lis 


15287 


9.24 


9.99 


9.6 


1.56 


Lie 


9421 


9.99 


10.82 


10.39 


1.50 


Ln 


5546 


10.82 


11.45 


11.16 


1.51 


Lis 


9610 


11.45 


12.21 


11.82 


1.34 


Lig 


19877 


12.21 


13.09 


12.64 


1.12 


L20 


24360 


13.09 


20.24 


13.69 


1.11 


All 


171698 





20.24 


7.91 


1.31 



a Ng a i denotes the number of galaxies in each group, compre- 
hended in the arc length interval [Z m i n , /max] of < Z > average 
arc length. The value d± denotes the quadratic mean (root 
mean square) of the projection distances from the data points 
onto the P-curve. 



6. GALAXY PROPERTIES AND STATISTICS AS 
FUNCTION OF ARC LENGTH 

In this section we show how galaxy properties, lumi- 
nosity functions and spatial clustering change as a func- 
tion of the {Li}l^l° equal number density galaxy groups 
(ordered in ascending arc- length) . 

Compared to PCi alone, the principal curve provides 
much more information about particular changes in prop- 
erties along its arc length. We will see that the evolution 
of galaxy properties along the curve is intimately related 
to the "W" shape of the principal curve, where each of 
the 4 branches define particular galaxy populations. 

6.1. Morphology and Average Spectra 

Figure [5] shows the most representative galaxy mor- 
phologies and average spectra for the {Li}\^° groups. 

The most evident feature is the change in color and 
the slope of the spectra (from blue to red), as well as an 
overall weakening of emission lines (e.g. Balmer series 
of Hydrogen and forbidden lines, such as OUT, Oil, Nil, 



etc.) and an increase of metallic absorption lines and 
bands (Na,Mg,H,K,G) as we reach high arc length values. 
In the same way, morphological types include various 
types of blue galaxies at the beginning and middle of the 
curve, whereas red ellipticals dominate the end of it. This 
bimodality is expected and agrees with PCi in Fig. [TJ 
appearing also other stud ies as the change along the 1st 
princ ipal component fe.g lYip et al.l l2004bt iCoppa et al.l 
l2Gll . We can, however, identify as well more subtle 
populations along arc length, not distinguishable in PCi 
alone. These distinct population are defined on each of 
the 4 branches of the principal curve, connected by the 
3 turning points. 

With respect to morphologies, we see that the arc 
length correlates very well with the Hubble galaxy type. 
We however miss the distinction between barred/non- 
barred spiral galaxies due to the lack of properties able to 
separate them. Blue irregulars and blue compact dwarf 
(BCD ) galaxies (|Papaderos et al. 1 120061: iCorbin et aT~l 
appear in the 1st branch of the principal curve. 
Some of these type of BCDs w ere identified as the gree n 
pea galaxies at higher redshift (jCardamone et al~ll2009fl . 
These morphologies change then into low surface bright- 
ness galaxies (LSBGs) with spiral and irregular shapes, 
which dominate the 1st turning point and blue maxi- 
mum. Bright spirals with strong blue star forming arms 
appear in the second branch, which by the 2nd turning 
point show sizable bulges. A dramatic change happens 
in the 3rd branch, where reddish big-bulged spirals and 
lenticulars dominate, forming part of the green and red 
maxima. A new transition happens at the 3rd turning 
point, having the big bright red ellipticals (CDs) and 
brightest cluster galaxies (BCGs) dominate at the end of 
the P-curve's 4th branch. 

Emission lines, such as the forbidden Oil, OIII, SII 
and Nell, as well as the Balmer series of Hydrogen (e.g. 
H a , H^, H 7 ), are strong in the violently starforming blue 
galaxies at the 1st branch. These lines weaken as we 
transition into LSBGs, but interestingly H a and Hg be- 
come stronger in the 2nd branch, reaching maximal val- 
ues in the starforming spirals at the 2nd turning point. 
After this, they weaken again to become imperceptible 
in the bright ellipticals in the 4th branch. Nil follows 
the same pattern as H Q , but somehow remains still vis- 
ible in CD galaxies , as seen in many spectral atlas (e.g. 
iDobos et alHl2012[ ). On the other hand, OIII declines 
steadily through arc length, disappearing after the red 
maximum. 

Absorption lines, such as Na, Mg and the G band be- 
come evident in the starforming spirals by the end of the 
2nd branch (as the bulge increases in size), and appear 
strong in the ellipticals at the 4th branch. Although 
the H and K lines of calcium are always visible, the 
4000Abreak increases steadily with arc length, turning 
into a striking feature in bright ellipticals. 

6.2. Spectral and Physical Properties 

Figure [6] shows the evolution of the galaxy properties 
as a function of arc length in the principal curve, whereas 
Table [3] contains the average values of the properties at 
each {Li}i=l°. 

Looking at the 7 properties on which the WPCA was 
built, the most important feature is that they present the 
same shape as the PC^ in which they have the greatest 
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15 20 

Arc-length 

Fig. 6. — Galaxy properties (y-axis) as a function of arc-length in the principal curve (x-axis). Properties are ordered row-wise, and 
grouped with respect to which principal component they look like the most as in Fig. [4](upper right hand corner of each panel). The PCi 
case resembles a straight line, PC2 a "W" and PC3 a boomerang. The black circles at the mean arc length value within each {Li}'^^ 
group show the position of the median of the distribution of the property in it, together with vertical bars spanning the 15.9% to 84.1% 
qua ntilcs (±1ct). The orange and cyan bars show respectively the same quantiles for the red spine and red spiral blob galaxies discussed in 

Sec. nog 
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TABLE 3 

Medians of galaxy property distributions in each galaxy group, together with the 15.9% to 84.1% quantiles (±1ct) . 



Group log M« /L r log SFR/M, u-r g-r r-i g~z D n (4000) Lick G4300 Lick Fc4531 Lick Mg 2 Lick Na D Lick H5 A 

(MeL-l) (yr- 1 ) (A) (A) (A) (A) 



Li 


-0. 


.46l°0 


19 
.21 


-9 


-26l° D 


41 
24 


1 


.oii°o 


17 

2 4 





■2ll l 


09 
.11 





.131° 


07 
.06 





48l° 


14 
18 


1 


•iot° 


06 
.08 





•2S+S 


78 

.70 


1 


■42tS 


82 
.87 





.071° 


02 
.02 


1 


•rati 


04 
.78 


2 


91 + 2 


95 

96 


L 2 


-0 


.38+.°, 


IB 


-9. 


•45t° 


25 


1 


.161° 


15 





.271°, 


08 





■"IS 


05 





.561° 


12 


1 


13tS 


06 





.39+i 


04 


1 


.62+° 


98 





.07+° 


03 


1 


■48t° 


65 


3 


.72+1 


49 
.37 


L 3 


-0 


.38t° 


12 


-9. 


•5ll°o 


24 


1 


.211° 


12 





.281° 


07 





•"IS 


05 





57lg 


11 
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•letS 


00 





.63+; 


18 


1 


.48+° 


94 





06tS 


03 


1 


■4ltS 
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4. 


35+; 


45 

.58 


Li 


-0 
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10 
.12 
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50i° 
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■55l° 


11 
.13 
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03 
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.00 
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08 
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1 
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16lS 


04 
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12 
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08 
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89 
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.60+1 
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03 
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.18+; 


42 


4. 


25ti 


92 
02 
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-0 


.32±° 


06 


-9. 


.621° 


12 
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.341° 


09 





.321° 


03 





■17lg 


03 





.69t» 


09 
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•aitS 


10 
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33 
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.58+ 2 2 


1 3 
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09 
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.43l°o 
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The blob and red spine groups are detailed in Sec. 16.51 

leverage. On the other hand, the properties not present 
in the WPCA show similar shapes or behaviors, depend- 
ing to their individual correlations with the initial 7 prop- 
erties. Generally, their behavior (as a function of arc- 
length) is defined by the 4 branches and 3 turning points, 
resembling in most cases a distorted W of the P-curve. 

Thus, we can group all the properties with respect to 
which PCi they resemble the most. For example, Fig. Q] 



shows that log M*/L r , log SFR/M* , u — r and g — r have 
the greatest leverage in PCi from all first 4 PCs. This 
makes these properties resembling like the shape of PCi 
in Fig. |4l where it's mostly a linear relation with respect 
to arc length, with a scatter modulated by the turning 
points. Other properties that correlate with PCi are 
r — i, g — z, Z?„(4000), some Lick indices and [OIII]. Note 
that [OIII] has a strong linear dependence on PCi and 
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behaves differently to the other emission lines (such as 
Balmer series) due to the higher ionization degree. The 
arc length also correlates linearly with eclass, which is 
the classi fication parameter derived in the PC A of galaxy 
spectra in lYip et al.1 ()2004bl ). defined as a function of the 
expansion coefficients in the base of the first 2 eigenspec- 
tra. In general, these properties can be expressed as a 
linear combination of each other, as seen in astrophysica l 
use, e.g. M»/L = ax Color + b (e.g. lBaldrv et al.ll200l ). 

In the same way, log SFR and 1150^ resemble strongly 
the "W" shape of PC2. Some emission lines belonging 
this group are Ni l and H„ , the lat ter being a well known 

Also, M r is a good 
both of which related 



proxy for SFR dKennicutt^ 
proxy for log M* dBell et al 



I99i 



200S 



to i?90 r and /i* jr) and seemingly correlate with H5o, r - 
Interestingly, the shape of the average Lick Na D index 
is similar to the "W" shape of PC 2 (and SFR), but 
the larger la dispersion in the average makes it not very 
significant. Lick Na D is expected to represent a strong 
absorption feature in old stellar populations, as shown 
in the ellipticals at high arc lengths. We can however 
see that it presents also a relatively high average at the 
second turning point. This is related to the fact that 
Na absorption is not only present in stars, but also in 
the inter stellar medium as a consequence of outflows 
or winds present in high star-formation spiral galaxies 
(IChen et al.ll2010D . 

Note that the boomerang shape of i?90 r /i?50 r is al- 
most identical to PC3. Al so correlating with the conccn- 
tration index is f racDeV r (jStoughton et al.ll2002f ). which 
determines the mixing in the modeling of the light pro- 
files galaxies, between an exponential disk and a de Vau- 
couleurs r 1 / 4 law for the elliptical bulge. 

6.2.1. The BPT Diagram 

Figure [3 shows the emission-line ratios BPT diagram 
([Baldwin et al.lll981| ) of the MGS galaxies, where AGN 
identification can be done easily. We considered galax- 
ies presenting Ha, H/3, [Nil] and [OIII] emission lines 
with well measured equivalent widths, of fractional er- 
rors smaller than 0.33 and velocity dispersion smaller 
than 500km s _1 . These cuts make the L\ — groups 
be reduced to ^90% of their size, going down to 35% for 
the remaining groups at higher arc length. Since none 
of these emission lines where included in the building of 
the WPCA, we overplotted the average location of the 
{-Mfci groups. 

The average locations of the groups can be seen to be 
connected by a two-branched track in line-ratio space. 
The left branch covers the region of star forming galax- 
ies, whereas the right b ranch crosses the separator of 
IKauffmann et all (|2003bD into the region filled by AGNs. 
Interestingly, the joining point between the 2 branches 
happens at L14, which contains the 2nd turning point in 
the principal curve. This is a striking feature, as it shows 
that the P-curve is powerful enough to describe galaxy 
properties beyond the ones included in its construction. 

6.3. Luminosity Functions 
Figure [5] shows the luminosity function (LFs) corre- 
sponding to the equal number density groups. 
They can be directly compared to the evolution of M r 
as a function of arc-length in Fig. [6l The LFs were 
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Fig. 7. — BPT diagram of the MGS galaxy sample. Symbols 
connected with straight lines track the positions of the average 
log[NII]/H a and Log[OIII]/H^ of each {ij}|=f° (la dispersion 
bars also included). Colored dots are a random 2% samples of 
each group. Dashed and dott ed lines show the se parators from 
IKauffmann et alJ H2003bl 'l and IKewlev et all 1(20011 ). respectively, 
between pure starforming galaxies (left region), composite (cen- 
tral) and AGN (right). 



TABLE 4 
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Group 


4>* x io a 6 


M* 


a 




All 


4.90 ± 0.14 


-21.30 ± 0.03 


-0.91 ± 0.02 





ii 


0.13 ± 0.02 


-20.45 ± 0.12 


-1.60 ± 0.05 





L-2 


0.19 ± 0.03 


-19.90 ± 0.13 


-1.54 ± 0.07 





i 3 


0.27 ±0.08 


-19.48 ± 0.21 


-1.56 ± 0.13 





Li 


0.36 ± 0.08 


-19.21 ± 0.15 


-1.62 ± 0.10 





u 


0.85 ± 0.21 


-18.45 ± 0.18 


-1.34 ± 0.18 





Lc, 


0.68 ± 0.19 


-18.54 ± 0.17 


-1.69 ± 0.15 





Lr 


1.50 ± 0.10 


-18.07 ± 0.07 


-0.87 ± 0.10 





is 


1.13 ± 0.09 


-18.34 ± 0.11 


-0.65 ± 0.16 





Lg 


0.93 ± 0.05 


-18.86 ± 0.06 


-0.76 ± 0.07 





ilO 


0.96 ± 0.04 


-19.20 ± 0.06 


-0.42 ± 0.08 





ill 


1.04 ±0.01 


-19.41 ± 0.04 0.03 ± 0.06 





il2 


1.00 ± 0.01 


-19.85 ± 0.04 0.12 ± 0.06 





il3 


0.99 ± 0.01 


-20.34 ±0.02 


0.06 ± 0.03 





il4 


0.95 ± 0.01 


-20.68 ± 0.02 


-0.07 ± 0.02 





il5 


0.69 ± 0.01 


-20.92 ± 0.03 


-0.46 ± 0.03 





il6 


0.16 ± 0.02 


-22.01 ± 0.15 


-1.16 ± 0.04 


-0.32 ± 0.02 


L„ 


0.016 ± 0.004 


-23.15 ± 0.28 


-1.69 ± 0.03- 


-1.167 ± 0.004 


ilS 


0.41 ± 0.09 


-20.69 ± 0.24 


-0.93 ± 0.14 





Group 


0* x 10 4 6 


MAf 






il9 


8.52 ± 0.19 


-20.46 ± 0.02 


0.78 ± 0.02 




L2O 


9.00 ± 0.11 


-21.34 ± 0.01 


0.70 ± 0.01 




Rod Spine 0.55 ± 0.02 


-21.51 ± 0.02 


0.46 ± 0.01 





a Fittings to Eqns. [3] and [4] The parameters 
in Magnitude-space can be expressed as luminosities us- 
ing M r =-2.5 1og in J L/Lo1+M Q , r , where M Q , r = 4.62 



(Blanton ct al. 2001). 
b 4>* i n units of Mpc 



Mae 



computed with the y mra method of iSchmidtl (|1968l ) ex- 
plained in Sec 15. 1( where the estimated LF value at each 
magnitude bin is the sum of the weights of all galaxies in 
that bin, with to, = V^xi- Table [4] contains the fitting 
parameters, choosing the fitting functions to be: 



• Double power-law: 
$(L)dL = < 



^ 1 



'i (3) 

ij* 



grante d that 1 + £(L/L ? ) > . This double power- 
law fit (|Alcaniz fc L ima 2004) collapses when £ = 
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Red Maximum 
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Turning Point 3 






Turning Point 2 

L14 






L10 




L15 





into the Schechter fit (|SchechteiHl976f ) $(L)dL = 
4>* (L/L*)" exp(— L/L*)dL/L*. The £ parameter 
can be relat ed to the tail index in extrem e value 
statistics (e.g |Gumbellll958tlGalambosll978l ). defin- 
ing for the brightest luminosities an infinite reach- 
ing power law tail (£ > 0), exponential tail (£ = 0) 
or a cutoff at a finite maximum luminosity L max = 

Vl£l (£<o). 



_24 -21 -18 -24 -21 -18 -24 -21 -18 -24 -21 -18 -24 -21 -18 

M r 

Fig. 8. — Luminosity functions of the groups ordered by increasing arc-length (blue triangles with a dashed line fit from Table 

RJ. The aggregated luminosity function of all the Li samples is shown as black circles with the Schechter fit as a black continuous line. 
The red diamonds deno te the luminosity function belonging to the group of red galaxies located very close to the principal curve within 
the L20 group (see Sec. I6.5.2H . 

the 1st branch, and then brighter afterward. On the 
other hand, the slope of the faint tail behaves simi- 
larly to PC2- In fact, it is steep in the 1st branch, 
and becomes shallower in the 2nd branch, then again 
steep in the 3rd branch. The 4th branch contains an 
extremely shallow slope, where the luminosity functions 
resemble mostly a log-normal distribution. We can see 
that we recover the luminosity functio n shapes shown in 
iBinggeli. Sandage fc Tammannl (|1988[ ). which are based 
on morphological types, and vary between schechter fits 
(as in L\ to L15) and bell shaped luminosity functions 
fitted by Log-normal fits (such as the L19 and £20)- 

As we progress along the 1st branch of the prin- 
cipal curve (L\ to Lg), the blue compact dwarfs be- 
come less luminous on average, having M* dimmer in 
about AM* ~ 2, with a mostly constant steep faint-end 
(a ~ —1.55). Note that these galaxies, and mostly the 
low surface brightness spirals at the 1st turning point, 
create the bump seen in the overall luminosity function. 
In fact, Lq contains the faintest galaxies in our sample 
(Fig. [5]). At this point, a = —1.69 gives the steepest 
power law slope at the faint luminosity tail. Note that 
this slope is expected to reach a ~ —1.5, as noted in 
iBlanton et aT~l pOQSl ) . 

In the second branch (L7 to £13), the star forming 
spirals present a faint-end that flattens dramatically and 



Log-normal: 



$(L)dL 



1 



Lctlv 27T 



exp(- 



(lnL-ML) 2 
2ai 



)dL 



(4) 



Note that a Log-normal distribution in luminosity- 
space is equivalent as a normal-Gaussian distribu- 
tion in magnitude-space. 

The luminosity function of the whole sample is well fit- 
ted by the Schechter fit, except for the bumps at the high 
luminosity tail and at the low lu minosity end s t arting 
at M r ~ -18.8, also obse rved by IBlanton et"al~l (|2003P ) 
and IBlanton et al. I ()2005D . We fitted it right before the 
bump. 

The changes in the behavior of the luminosity func- 
tion along the P-curve are determined also by the 3 
turning points. In summary, M* becomes fainter in 
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starts to drop continuously, with an increase of Aa ~ 1.6 
from L-j. At the same time, they start becoming much 
more luminous, with M* brightening in AM* ~ —2.2 
(from Lq). 

In the 3rd branch (£14 to L17), for the red spirals and 
lcnticulars M* continues becoming brighter (AM* ~ 
—2.5), but at the same time the faint-end slope starts 
turning steep again (Aa ~ —1.6), back to the values of 
a ~ —1.6 found at the end of the 1st branch. Note that 
Li6 (green maximum) and Ln (red maximum) show long 
power-law faint ends with a sharp cutoff at the bright 
end. They are better fitted by a double power law fit, 
and since £ < they present bright-end finite cuts at 
M r ~ —23.0 and —23.3 respectively. 

The 3rd turning point (£17) is a unique case. The 
luminosity function presents 3 powcr-law-likc sections, 
the faintest one being flat. We attempted to fit it with a 
Schcchter profile. 

In the last 2 groups (L19 and L 2 o), the faint end tail 
has dropped enormously. We attempted a Log-normal 
fits for the luminosities, since the luminosity functions 
look more bell-shaped, specially the ones belonging to 
the {Ai}^f groups that track the spine. 

6.4. Galaxy clustering 

In this section, we investigate the second moment of 
the galaxy distribution as a function of the arc length, 
the spatial distribution, quantified by the clustering. We 
explore here not only the dependence of the galaxy clus- 
tering as a function of L, but also the relative distribution 
of galaxies as a function of L, which can be quantified by 
the cross-correlation function. 

Following common practice, we compute first the red- 
shift space correlation function, as a function of the dis- 
tances parallel (n) and perpendicular (r p ) to the line of 
sight We use a generalized versio n ()Szapudi fc Szalavl 
119981 ) of the lLandv fc Szalavl (|1993f ) estimator 

\ D a Dt — D a Rb — DbR a + R a Rb , r , 

fe n) = RjT b (5) 

where the subscripts a and b refer to the two samples 
we are considering when measuring the cros s-correlation 
functi on. We use the same methods than iHeinis et al.l 
(|2009f ) to compute the correlation functions. In brief, for 
each sample wc generate random catalogs following the 
SDSS footprint defined by its sectors. We use 50 times 
more random objects than galaxies. We reproduce the 
selection function by randomly drawing rcdshifts from 
the current sample. We correct from fiber co llision using 
the method described in lHeinis et al.l (|2009() . Note that 
the fiber collision correction applies only to the D a Db 
term in Eq. (jSJ). 

As £(r p ,7r) is sensitive to redshift distortions, we con- 
sider the projected spatial correlation function afterward, 
which is free from such effects: 

w p (r p ) = 2 £(r p ,7r)d7r (6) 

Jo 

where we use n max — 25Mpc, for convergence pur- 
poses. 

We compute error bars on w p (r p ) from jackknife re- 
sampling. We build jackknife samples using the SDSS 



stripes, which arc defined to be 2.5 degrees wide great cir- 
cles on the sky, following the survey latitude. In practice 
we consider 23 jackknife samples built from the stripes. 

We use a volume limited sample extracted from our 
main sample (see Sec. In order to maximize the 

signal-ratio of the clustering measurements, we do not 
use 20 samples in L, but 8 of them built the follow- 
ing way: we collided L\ to Lq in two groups of sim- 
ilar number of galaxies (L w l and L w 2 ) , Ly to L\$ 
in one group (L w 3), and for the remaining L sam- 
ples, we grouped them two by two (L W A = {in, £12}, 
L w 5 = {£13, L14}, L w 6 = {Li5,£i6}, L w 7 = {Ln,Lis} 
and L w 8 = {Li 9 ,L 2 o})- 

Figure [9] shows the results for the auto correlation func- 
tions of these sample in the diagonal plots. As a refer- 
ence, we show as solid line in all diagonal plots the auto 
correlation function of the sample with highest arc length 
(L w 8). The results from the auto correlation function 
show that the amplitude of the correlation function at 
large scales (r p ~ 10 Mpc) increases with the arc length. 
This implies that the host halo mass also increases with 
I. This result is expected as I does correlate with u — r 
and g — r colors for instance. It is indeed well known 
that the amplitude of the correlation funct ion increases 
for redder o bjects in the local Universe (e.g. lZehavi et all 
l200lll20Tl . There are also some interesting features in 
the small scales (r p < 0.1 Mpc) clustering. Indeed, most 
samples show clustering power at all scales, except our 
groups 2 and 3 in particular, where there is an indica- 
tion of a lack of pairs at r p < 0.1 Mpc, suggesting a 
population mainly composed of central galaxies. 

In the off-diagonal plots, we show the cross-correlation 
functions between these samples. It is beyond the 
scope of this paper to fully interpret all these mea- 
sure ments with Halo Occu pation Distribution models 
(e.g. ICoorav fc Shethll200l . We will use simple argu- 
ments to highlight the information contained in the cross- 
correlation function. 

We represent as a solid line in all off-diagonal plots the 
expected cross-correlation function given by 



w x ab (r p ) = ^ w a (r p )w b (r p ) (7) 

where w a {r p ) and Wb(r p ) are the autocorrelation func- 
tions of samples a and b. Eq. ([7|) gives the cross- 
correlation function which is expected in the case where 
galaxies from the two samples are well mixed in the 
dark matter halos hosting them (see e.g. iZehavi et al.l 
2005). This is of interest at small scales, where the cross- 
correlation function contains information about close 
pairs of galaxies that lie within the same dark matter 
halos. Our results show an interesting trend in this con- 
text. Indeed the cross correlation function of galaxies 
with close arc lengths is similar to the expected cross 
correlation function from Eq. ©. On the other hand, 
the cross correlation of galaxies more distant in terms of 
arc length diverges from the expected correlation func- 
tion. In particular, the measures are overestimated by 
Eq [7] at scales r p < 0.1 Mpc. This means that there 
are less close galaxy pairs in the measures than what is 
expected in the case of a perfect mix between galaxies 
of different arc length. Note that there is still clustering 
signal at these scales, which means that there are some 
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Fig. 9. — Left: The diagonal panels show the auto correlation function of the {L w i} groups. The solid black line is the auto correlation 
function of L w 8 shown for reference. Off diagonal panels show the cross correlation functions between the groups, where the red solid line 
represents the expected cross correlation function when the galaxies from the 2 samples are well mixed in the dark matter halos. 
Right: Same layout as in figure on the left, but showing the ratio between the measured cross correlation function to the expected one 
when the galaxies from the 2 samples are well mixed in the dark matter halos. 

galaxies belonging to distant L groups in the same dark 
matter halos. However, our results show that there are 
fewer pairs than what is expected if galaxies arc prop- 
erly mixed. This suggests that the probability that two 
galaxies reside in the same dark matter halo decreases aa 
a function of their distance in arc length. 

6.5. Interesting Galaxy Groups found from WPCA and 
Principal curve classification 

The analysis of galaxy properties with WPCA and P- 
curve methods allowed us to find and pinpoint some rele- 
vant groups that stand aside from the main trends of the 
whole galaxy sample. In particular, here we pay atten- 
tion to the small blob of galaxies that clusters apart from 
the main cloud in Fig. [2] Another interesting task is to 
isolate a pure population of red galaxies in the L20 group 
whose luminosity function is Log-normal, as it appears 
in Fig. E 

6.5.1. Blob of small red disk galaxies /lenticulars of high 
M*/L r and jU*,so 

In Fig. [5] we found a small blob of galaxies clustered 
apart from the main cloud in the PC3 v/s PC4 panel. 
We separated them by using the following separating line 
(built by eye): PC 4 < -1.3 + 0.55(PC 3 - 2.0). We fur- 
ther constrained these galaxies by choosing an appropri- 
ate interval in arc length [l m in, Imax] — [H-6, 12.6], which 
includes most of the Lis and part of L19 groups. In fact, 
the blob appears also in Figure [S] bracketed within this 
arc length range in the i?90 r /i?50 r , /^*. r and logi?90 r 
panels. Precisely, i?90 r /i?50 r is the property that dom- 
inates in PC3 and PC4 with opposite signs, as shown 
in Fig. Q] After the selection cuts, we are left with 136 
blob galaxies, whose imaging and average spectrum are 
shown in Fig. [TO] 

According to Table [3l the blob is composed mostly 
by small red disk galaxies, with minimal star forma- 
tion and void of gas. In fact, u — r is at least la red- 
der than the average in the Lis group. Furthermore, 
they are modeled with an important component of an 
exponential disk (fracDeV r = O.OOlo oo)- Thc 

concen- 
tration index R9Q r /R50 r ~ 1.95 is low and far from 




red spine galaxies 
L20 



4000 5000 X(A) 6000 7000 

Fig. 10. — Equivalent to Fig. \E\ but showing interesting groups 
derived from WPCA and P-curve analysis. 

Top: Panels with the 4 most representative galaxy shapes for 
the small red disk-like galaxies blob (left) and red spine galax- 
ies (right). 

Bottom: Average spectrum for the 2 previous groups. 



the often used R90 r /R50 r = 2.6 separator betw een el- 
lipticals and spiral galaxies (jStrateva et al.l[200ll ). The 
average size of i?90 r ~ 1.6kpc is small compared to 
the one of Lis (R90 r ~ 4.0kpc), lying beyond the la 
significance level. Interestingly, the small size makes 
M„/L r ~ 4.O7M Lq r , fi5o,r — 18.83 mag arcsec 2 and 
/!*. r = lO lo ' O6 M0kpc~ 2 to be also well above la of their 
average at Li 8 . With respect to the uncertainty of these 
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values, errors in the estimate of M* can arise when fitting 
the spectra, specially when there is a strong component 
of dust, which these galaxies appear to have. The sta- 
tistical error in M* is aro und 15%. When co mparing to 
a catalog of groups from iTago et al.l (pOlOh . we found 
that at least half of these galaxies are in groups of more 
than 10 members. We speculate that they were depleted 
of gas by ram pressure stripping or another mechanism 
that did not perturb the structure of the disk. 

6.5.2. Close-to-spine Red galaxies of Log-Normal luminosity 

function. 

In Fig. |8]we showed that the L20 group of red galaxies 
has a luminosity function close to a log-normal distribu- 
tion, clearly different to the Gamma and double power- 
law distributions of the other groups. For further study, 
we wanted to isolate the galaxies in this group whose 
luminosity function is exactly log-normal. 

In order to extract these galaxies, we choose the ones 
falling in {Ags, Agg, A100}, which are the last 3 subgroups 
of L20 as explained in !5.3l We further chose the 1st parti- 
tion closest to the P-curve, out of the 10 radial partitions 
in each Ai group, selecting therefore galaxies on or very 
close to the spine of the data point cloud. 

The selected red spine galaxies are part of the very 
high density core of the red sequence of galaxies found 
in L2o. In fact, Tableland Fig. [6] show that they have 
very similar properties values as their averages of the 
whole L20. They are mostly red ellipticals (u — r ~ 2.62, 
fracDeV,. ~ 1 and R90 r /R50 r ~ 3.23), of mass M t ~ 
7.08 x 10 10 M© and luminosity L r ~ 2.86 x lO lo L 0iI .. 
Figure [8] shows that the luminosity function of is in fact 
very close to log-normal, with parameters shown in Table 

HI 

7. DISCUSSION 

The unsupervised non parametric methods of WPCA 
and P-curve should not be considered useful only for di- 
mensionality reduction and easy data visualization. In 
this paper, these methods proved also being able to 
provide supporting evidence for some physical models 
and scenarios relevant in extragalactic astronomy, as dis- 
cussed next. 

7.1. Information Content of the Principal Curve 

The principal curve provides an objective way for or- 
dering galaxies along its arc length. The success in di- 
mensionality reduction and classification power of the P- 
curve is related to how much the projections of the galax- 
ies are spread along arc length. In fact, a big variance 
along arc length gives more room for building separa- 
tors and discerning different galaxy types. In our case, 
the arc length values along the P-curve have a variance 
of of = 7.79 (see Tabled]), bigger than cumulative vari- 
ance E^f(Tp C . = 6.81 of the principal components it was 
built from. This means that the curvature of the P-curve 
helps to discern information not included in the intrinsic 
linearity of WPCA. The length of the P-curve cannot be 
made arbitrarily long or short, due to an evident bias- 
variance trade-off. The shortest curves (with no curva- 
ture) are identical to PCi, with an average spread across 
it equal to S'^^pc ! ano - a high bias due to the straight 
P-curve missing the important bends in the structure 



of the cloud of points. On the other hand, the longest 
curves possible would be the ones connecting all the data 
points, which produce a null bias but high variance, as 
the curve will fit the noise in the structure of the cloud. 
In fact, as we experimented with values of df ~ 7, the 
curve attempts to cover all the space spanned by the 
cloud, twisting and coiling itself in ways that describe 
additional detailed features of galaxies, while we are now 
interested in the global trends. Our election of df = 5.4 is 
an intermediate case, where the root mean square of the 
projection distances on the curve is d± = 1.31, smaller 
than (S'S^pc,) 172 = !-52. The ratio l-af/d 2 x = 0.75 
gives us a notion of the amount of information that the 
P-curve is able to discern. The physical origin in the 
scatter of the remaining 25% is still to be explored, and 
depends locally on the direction in the eigenspace along 
which d± is measured. 

7.2. Explaining the Zoo of Galaxies 

In our analysis, the P-curve has been able to recover 
the well known bimodality between the blue and red 
populations. Since galaxy properties are highly corre- 
lated, only a few properties should be enough for ex- 
plaining the variations in the zoo of galaxies, namely 
u — r (from PCi), SFR (from PC2) and less importantly 
i?90 r /i?50 r (from PC3). In fact, the variations recov- 
ered by the P-curve and its "W" shape depend strongly 
on SFR. The color u — r, almost linearly correlated 
with arc length, tracks specifically the 4000Abreak in 
the continuum, which gives a measure of stellar age and 
separates the early to late galaxy types. However, this 
is not enough, as i?90 r /i?50 r tells about morphology, 
whereas the SFR tracks the amount of material pro- 
duced in recent star bursts, which shows as the strength 
of emission lines such as the Balmer series, and eventu- 
ally correlates with the galaxy size, mass and luminosity. 
For example, within the blue population (bluer than the 
green maximum) we can find the low star-formation and 
surface brightness spiral galaxies separating the bluest 
star- forming sphcroidals/irregulars and the redder star- 
forming spirals with a prominent bulge. The importance 
of the color and emission lines from star formation in 
explaining variations in galaxy popula tions has also ap- 
peared in pr evious PC A studi es (e.g . lYip et all l2004bl : 
iCoppa et al.1l20"TTl : iGvorv et alJI^Ll . 

7.3. Additional Evidence supporting some Physical 
Models 

7.3.1. AGN Activity and Star Formation Quenching 

The P-curve presents a green density maximum, be- 
tween the blue and red ones. The green maximum shows 
an interesting feature in PCi (Fig. colors and M*/L r 
as a function of arc length (Fig[H]). The average of these 
properties keep on increasing as the arc length increases, 
except at the green maximum in Li6, where they stay 
constant or even decrease their values. This behavior, 
however, is not seen for example in SFR/M*, whose av- 
erage continues decreasing monotonically at Lig. Note 
that Li6 is the last group which shows significant star for- 
mation and/or emission lines (see Fig. [5]). Indeed, the 
equivalent width of H a drops by 1 dex (see Table[2]) when 
moving to L17, this last group is consistent with no H a 
emission given the 1A resolution of SDSS spectra. Fur- 
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thcrmore, Lig is the last group in the pure star forming 
region branch on the BPT diagram (Fig. [7]), right before 
the bordering region separating the pure star forming 
and composite regions. Thus, higher arc length groups 
have basically small to null star formation activity and 
contain AGN. This is in agreement with the findings that 
AGN activity might be the c ause for the shutdow n of star 
formation in these galaxies (|Martin et al J 120071) . 

7.3.2. Hierarchical Model of Galaxy Formation 

The luminosity functions shown in Fig. [5] can be clas- 
sified into roughly gam ma and log-normal distributions . 
It has been sho wn, e.g . iCoorav fc Milosavlievic I (|2005[ ) 
and lYang et all (|2009i ). that the LF of the Schechter fit 
for LFs can be divided into several components, com- 
ing from 2 different populations in dark matter halos: 
central or brightest cluster galaxies (BCGs) and satellite 
galaxies. Satellite galaxies are often given a power law 
LF, with a finite cut at the bright end given by the lu- 
minosities of the central galaxies. The centrals, on the 
other hand, are given bell-shaped luminosity functions. 
In par ticular, high mass halos ( Mh > 10 13 M Q , accord- 
ing to ICoorav fc Milosavlievicl (|2005D ). contain central 
galaxies whose luminosity functions can be modeled as 
a log-normal distribution. This is exactly the behavior 
observed for the Lig and L20 groups in Fig. [8l and better 
seen for the red spine galaxies shown in Sec. 16.5.21 Note 
that L\g and L20 corresponds to L w 8 in Fig. [9j which 
appears having an autocorrelation function with stronger 
power at r p =10Mpc than any other group. On the other 
hand, at r p < O.lMpc there is a clear loss of power. This 
is consistent with Lig and L20 being mostly composed 
by central galaxies. Note that there is still power in the 
autocorrelation function of L w 8 at r p < O.lMpc, which 
shows that there are some satellite galaxies (mostly red) 
in this sample, which is consistent with the faint end tail 
of the LFs in Fig. [SJ 

Log-normal distributions appear in nature as a conse- 
quence of multiplicat ive processes (|Lempert et al.ll200H : 
IMitzenmacherl 120031 and references therein), where 
the initial value Yq of a random variable is changed 
in successive steps in the form Yj — FjYj_i by i.i.d 
multiplicative factors Fj of distribution P(F). Using 
the central limit theorem for j — > 00, it can be shown 
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that Y follows a log-normal distribution, independent 
of P(F). This argument can be extended to explain 
the log-normal luminosity functions of central galaxies 
and their stellar mass functions as well, since the 
r-band luminosity traces a population of old stars, 
which form the bulk of the mass of galaxies ([Bell et al.1 
20031). In fact, hierarchical galaxy formation mod e ls (e.g 
Steinmetz fc Navarro! 120021: iDe Lucia fc Blaizotl 120071) 
explain the creation of massive elliptical BCGs as a 
series of dry mergers of existing galaxies. Thus, a dense 
environment will allow several steps of mass adding or 
stripping that might lead to the formation of BCGs and 
cause the log-normal mass distributions for them. 
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